파이썬을 이용한 다양한 형식의 웹 데이터 크롤링 기법

승리; 윤수진; 우영운; Li Seung; Sujin Yun; Young Woon Woo

연구문헌

학술대회 프로시딩

홈 > 연구문헌 > 학술대회 프로시딩 > 한국정보통신학회 학술대회 > 2019년 추계학술대회

2019년 추계학술대회

Current Result Document :

한글제목(Korean Title)	파이썬을 이용한 다양한 형식의 웹 데이터 크롤링 기법
영문제목(English Title)	Crawling Methods for Web Data of Various Formats Using Python
저자(Author)	승리 윤수진 우영운 Li Seung Sujin Yun Young Woon Woo
원문수록처(Citation)	VOL 23 NO. 02 PP. 0343 ~ 0346 (2019. 10)
한글내용 (Korean Abstract)	이 논문에서는 카페나 블로그 형식의 다양한 웹 데이터를 자동으로 수집하기 위한 각종 기법들을 제안하였다. 제안한 맞춤식 수집 기법들과 HTML 실렉터를 활용할 수 있는 Python 언어와 Beautiful Soup 라이브러리를 이용하였으며, 특수한 형태로 구성되어 있는 카페, 블로그 등에 게시된 텍스트 데이터를 자동으로 모두 수집할 수 있었다. 제안한 기법들을 활용하여 다양한 형태의 구조로 이루어져 있는 각종 특수한 웹 페이지들에 대해서도 Python 웹 크롤링 프로그램에 의해 자동으로 대량의 데이터를 수집할 수 있었다. 이를 통해 다양한 대화 지식이 필요한 챗봇 구현이나, 빅데이터 분석 연구에 활용될 수 있을 것으로 예상한다.
영문내용 (English Abstract)	In this paper, we proposed various techniques for automatically collecting various web data in cafe or blog format. We used the Python language and Beautiful Soup library, which can use the proposed custom collection techniques and HTML selector, and could automatically collect all the text data posted in cafes and blogs composed of special forms. By using the proposed technique, a large amount of data could be automatically collected by Python web-crawling program for various web pages with various structures. Through this, it is expected to be used for chatbot implementation that requires diverse conversation knowledge, or big data analysis research.
키워드(Keyword)	Web-crawling Python BeautifulSoup HTML selector Big data
파일첨부	PDF 다운로드